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During the two-year Relevance Assessment Project, 15 studies were designed 
and carried out with over 500 subjects used as relevance judges. The subjects were 
librarians, information specialists, library science students and faculty, and graduate 
and upper division students in psychology. Materials for judging were selected for 
subjects according to their interests and backgrounds. Results of the many 
experiments conducted indicate that relevance judgments can be influenced by many 
factors including (1) the skills and attitudes or the judges used, (2) the documents 
and document sets used, (3) the particular information requirement statements, (4) 
the instructions and setting in which the judgments take place, (5) the concepts and 
definitions of relevance employed in the judgments, and (6) the type of rating scale or 
other instrument used to express the judgments. Findings indicate that relevance 
scores must be used with caution for system evaluation. A list of articles and reports 
on the project is included. (CC) 
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Before discussing our research, I'd like to comment that - -^fc-H»rvery glad to s S * 

be talking before an APA audience again, after an interval of 13 years. 

The last time was in 1955, when X reported on a study that Dr. William 
Albaugh arid 1 had conducted on the communicability of clinical 
psychology reports. I began thinking about that study again a few weeks 
ago as I tried to remember what APA conventions and APA audiences were 
like. 
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When I left the field of psychology --at least as an active worker --some 
twelve years ago, X didn*t realize that X had any interest then in what » 

is now being called information science. Yet as I look back, X realize 
that even then X was interested in the flow of information and in professional 
communication. The study on psychological reports, which I associated only 
with clinical psychology work in my distant past, suddenly became highly 
relevant to my current work in the field of information science and 



technology. 
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Thinking about that study also reminded me that there was a time when we 
psychologists were noticeably thin skinned about the profession. 

One of the rather startling findings of our early study was that about 
half of the messages that clinical psychology report writers were trying 
to convey to the report readers either did not get through 
or got through in a highly distorted form. The reason I undertook the 
study was because of a gnawing feeling that the psychiatrists and other 
readers of the reports that I and other staff members and trainees were so 

*This is final draft of talk for 1968 Annual Meeting of the American 
Psychological Association, San Francisco, California. 
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painstakingly creating might not really understand what we were saying. 

We set about checking this hunch by developing a series of questionnaires 
for each of several psychological diagnostic reports, then asking 
psychiatrists, other physicians, nurses, staff psychologists, and psychology 
trainees to study the reports and indicate, by means of the questionnaire. 



what the writer had told them. 



Each questionnaire item was multiple choice and contained, in addition to 
one item that was almost directly lifted from the report or was a fairly 
direct paraphrase, one or more wild, totally wrong interpretations of what 

the author thought he had said. (The author incidentally, provided the 

/ 

criterion judgments for our test.) The result was about 50% "correct" 
responses. The psychology staff did a little better than psychology 
trainees, who in turn did a little better than student nurses and 



psychiatrists. 



This paper reporting on these results was accepted for the APA convention 
in San Francisco, and the APA publicity people planned to have 

press releases and interviews. These were subsequently called off, because 
of the feeling that the results of the study might be misinterpreted and 
misused by those hostile to psychology. I have occasionally wondered 
if half a generation of clinical psychologists since that time has continued 
to turn out acres of reports that half of a generation of psychiatrists 
still don*t understand. (I hope, incidentally, that APA is more comfortable 
and less thin skinned about its image now than it was 12 years ago.) 



All of this brings me circuitously to the subject of relevance. 
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I should begin by trying to indicate why the study of relevance judgnents 
has seemed important to the field of information science. 

The large-scale use of data processing equipment and procedures in 
libraries and document -handling systems began in the 1950s. Within a few 
years, the appreciation of a number of technical problems in such use, 
together with the fairly sizable costs of automated or semiautomated 
information retrieval systems, helped to awaken an active interest in 
system effectiveness end in means of measuring it« 

Almost from the beginning of serious work on the evaluation of information 
retrieval systems, which began in 1953 at the Library of Congress, attempts ^ 
to provide adequate criteria of evaluation were dependent on and bedeviled 
by the concept of "relevance”, relevance usually referring to a relation- 
ship between some kind of information need and some kind of system output, 
such as a document. The evaluation study in the Library of Congress proved 
inconclusive, incidentally, at least in part because the evaluators and 
the Library of Congress personnel could not agree on which documents were 

really ’’relevant. 1 * 

Evaluation studies generally followed the same general pattern: the 

holdings of a system were searched, in response to some kind of inquiry, 
and a subsequent judgment was made as to whether the resulting outputs 
were ’’relevant” to the inquiry. On the basis of these judgments, various 
scores were computed to express the system’s retrieval performance. After 
some years, these scores began to be discussed by some workers with a great 
deal of reverence. One such score was named ’’recall;” it refers to a 
ratio of two numbers : the number of ’’relevant’ documents produced by 

a retrieval system over the total number of "relevant” documents actually 
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in the system’s, store. It was quite common to have people write or say such 
things as "Information System X is performing* at 80 percent recall, while 
Information System B is performing at only 75 percent recall." 

Most of the workers engaged in retrieval system evaluation from 1953 to the 
present have had relatively little interest in relevance judgments ger_se. 

They have been interested in them primarily as a criterion by which to evalu- 
ate manual or computer-based searches, or comparisons between them. These 
workers, of course, have been aware of disagreement among judges, but they 
have tended to consider such disagreement largely as an irritant, to be 
stamped out or bypassed as quickly as possible, rather than as a phenomenon 
worthy of interest in its own right. Thus, in spite of the reliance in 
system evaluation on the notion of a "relevant set of documents," the 
relevance process itself has largely been treated as a black box, and theie 
has been very little effort to understand either what goes on inside the box 
or how variations in the judgments might lead to variations in the identifica- 
tion of the relevant set of documents. This is somewhat analogous to a 
situation in psychology where we used a test for the selection of personnel 
without knowing what the test measures, how the test items are interrelated, 
what factors cause variations in test scores, or what relationships individual 
test items and test dimensions have to specific aspects of job performance. 

Against this background, and against a backdrop of frustration and disagree- 
ment about the validity and implications of evaluation studies involving 
relevance, we began a project under support from the National Science Founda- 
tion, to develop some empirical information about human judgments of relevance. 
Our study at System Development Corporation began at the 
panion project at Case Western Reserve University, which Dr. Schultz will 
describe a little later. 
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In planning our project, we made two important methodological choices. First, 

« 

we focused on relevance as a relationship between a document and some public, 
visible expression of an information need, to wit, a written statement 
describing an information need. Second, we chose to follow an experimental, 
laboratory approach, to permit greater control over the important variables. 

It was our feeling at the outset that we might eventually be dealing with 
dozens or scores of variables, each of which could be measured in many ways. . 
It seemed important, therefore, to operate in circumstances that would help 
us to see as clearly as possible the effect of particular variables and the 
usefulness of particular measures. 

Prior to any experimentation, the Project staff developed a list of variables 
that might be contributors to variations in relevance judgments. Since there 
was little empirical evidence related to any of these, the list was based almost 
entirely on a •priori considerations. Groups of variables relating to five 
aspects of relevance judgments were identified: (l) Documents, (2) Information 

Requirement Statements, (3) the Judge, (4) the Judgment Conditions, and (5) the 
Available Mode of Expression. Within these groups a total of 38 variables was 
listed, as shown on the second page of the handout. 



During the two years of the project, we examined almost half of the 38 variables 
on our list. Fifteen studies were designed and carried out, using over 500 
subjects <as relevance judges. The subjects were librarians and information 
specialists, library science students and facility, and graduate and upper 
division students in psychology. Materials for judging were selected and/or 
created in accordance with particular experimental, objectives and the back- 
grounds of the judges. 
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The attempt to look at almost 20 variables was premised on the fact that 
many of these variables we re likely to be related to each other. Any single 
experiment— in fact, any relevance judging situation of any kind— must use a 
particular set of documents or representations, a particular set of informa— 
tion requirement statements, particular judges, particular judgment conditions, 



and particular modes of expressing the relevance judgments. Yet each of these 
is itself a potential source of variability. Therefore, one cannot generalize 
the results of any single experiment, no matter how we 11 it might be done, 



because the influence of other variables may not be known. For this reason, 
it was methodologically preferable to attempt a first-round assessment of many 
variables, rather than an intensive study of a single variable or small group ^ 
of variables, in isolation. 



There are several detailed reports on our studies, noted on the last page of 
the handout, and I won’t attempt to summarize them here. I would like to 
mention the findings from one of the studies having to do with the negotiation 



process between an information user and a librarian or information specialist. 



This study looked at what we call the "implicit use orientation" of the user. 
By use orientation, we mean the user’s expectation regarding the way in which 
he will use the information. For example, he may be trying to compile an 
exhaustive bibliography; or to identify articles that contain specific bits 
of information of some immediate practical use; or to get articles that serve 
no particular practical use but may have idea- stimulating value. You have 
been exposed to and experienced such orientations; what we wanted to do is 
see whether they influence judgments of document relevance. 
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To do this, we exposed about 150 judges to a set of 9 documents and 8 informa- 
tion requirements statements. Then, after eliciting relevance judgments, we 
got a second set of judgments under different conditions. For the second 
set of conditions, we, in effect, told the judges more about the users and 
their use orientations, and we asked the judges to consider themselves to he 
acting as agents for the users as they made their judgments on the documents 
and information requirements statements. For this part of -the experiment, we 
divided our judges into l4 groups, each of which was given a different use 
orientation. We then compared the resulting data with the judgments that the 
same judges had made earlier, without any special use orientation. The results 
showed that each of the 14 use orientations we imposed altered the relevance 
scores that the judges assigned to documents. A document what would be 
accorded high relevance for a bibliography orientation might be given low 
relevance for some other kind of orientation. 

What the study showed, among other things, is that relevance scores are very 
slippery. Documents clearly have no inherent, unchanging relevance to infor- 
mation requirements statements; the relevance values attributed to them 
really depends, in part, on how the documents are going to be used. 



This was only one of many experiments undertaken during our two years of work. 
The results from all the studies, taken together, show that relevance judgments 
can be influenced by many factors : the skills and attitudes of the particular 

judges used, the documents and document sets used, the particular information 
requirement statements, the instructions and setting in which the judgments 



take place, the concepts and definitions of relevance employed in the judgments. 



and the type of rating scale or other instrument used to express the judgments. 
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I don't think these kinds of findings would surprise many people trained in 
psychology. We have all been trained to think of behavior in terms of many 
variables, some of them highly complex and obscure. Yet the kind of research 
I have described is relatively new in the information science field and even 
though psychologists would fully expect experimental results to hinge on the 
kinds of Judges, documents and other variables involved in the experiment, 
neither they nor anyone else has been in a position to say how these variables 
behave in the document- Judging situation. 



From the standpoint of system evaluation — which was our orientation at the 



outset of the project — our findings cast serious doubt on the unquestioning 
use of relevance scores as table criteria for system or subsystem evaluation, 
because these scores are likely to be artifacts of particular systems and of 
the particular conditions of relevance measurement. Thus, they may not 
deserve the aura of quantification, validity and stability that they currently 
enjoy. Our findings also suggest that the use of single figures of merit 
(for example, "our system has 80 percent recall") can be quite misleading in 



comparisons between different information systems or, indeed, under any circum- 



stances where the sources of variability mentioned above have not been taken 



into account and controlled. Too, even if one were to develop stable and 



meaningful figures of merit for information systems, then what does he do to 



improve the system? It is obvious that system improvement rests not on overall 
figures of merit but on sensitive diagnostic information on particular aspects 



of system performance, such as Dr. Katter discussed in his paper. The in^ortance 
of relevance work is not that it will provide better figures of merit, but that 
it will help us to understand better the interface between the information 



system, on the one hand, and the user or intermediary, on the other. Such Tinder- 



standing is an absolute requirement for effective diagnosis. 
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Our studies have x^ovided results that I know are frustrating to some workers 
in the information science field, and there has been some feeling expressed 
that relevance judgments are not only suspect but unworthy of study; therefore 
we should dispense with them. That, to me, is a very cheap way out of a 
predicament. You will recall that many years ago, in the field of psychology, 
there was a widespread revolt against subjective phenomena — in part because 
of the same frustrations some of us have experienced with relevance. The out- 
come of the revolt was that, for a time, psychologists devoted their attention 
not to what was important, but to what was measurable. This is a surefire 
approach to a certain kind of respectability, which information scientists now 
desire as much as psychologists did then, but it risks losing the baby with the 
bathwater. I believe that, when information scientists fully accept the fact 
that relevance phenomena are complex and slippery, they should not take the 
easy way out of simply turning their back on such phenomena. Relevance 
judgments, however disguised and however renamed, are indispensable aspects of 
our field, and part of the challenge is to admit their complexity, to start 
trying to learn what they are about, and to begin building better, and less 
elastic, rulers to measure them. 
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